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Abstract 

High density clusters can be characterized by the connected components of a level set L(A) = {x : 
p(x) > A} of the underlying probability density function p generating the data, at some appropriate level 
A > 0. The complete hierarchical clustering can be characterized by a cluster tree T = \J X L(X). In this 
paper, we study the behavior of a density level set estimate L(A) and cluster tree estimate T based on 
a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the 
variability of L(A) and T as a function of h, and investigate the theoretical properties of these instability 
measures. 
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1 Introduction 

A common approach to identifying high density clusters is based on using level sets of the density function 
(Hartigan (1975), Rigollet and Vert (2009)). Let Xx, . . . ,X n be a random sample from a distribution P 
on R d with density p. For A > define the level set L(X) = {x : p(x) > A}. Assume that L(\) can be 
decomposed into disjoint, connected sets: L(A) = (Jf=i Cj- We refer to C\ — {C\, . . . , C N ^} as the density 
clusters at level A. We call the collection of clusters 

T = (J Ca (1) 

A>0 

the cluster tree. Note that T does indeed have a tree structure: if A, B e T then either, A c B, or B c A or 
A n B = 0. The tree summarizes the cluster structure of the distribution; see Stuetzle and Nugent (2009). 

It is also possible to index the level sets by probability content. For < a < 1, define the level set 
M(a) = L(\ a ) where 

X a = sup{A : P(L(X)) > a}. (2) 

If the density does not contain any jumps or flat parts, then there is a one-to-one correspondence between 
the level sets indexed by the density level and the probability content. The cluster tree obtained from the 
clusters of M(a) for < a < 1 is equivalent to T. Relabeling the tree in terms of a may be convenient 
because a is more interpretable than A, but the tree is the same. Figure 1 shows the cluster tree for a density 
estimate of a mixture of three normals (using a reference rule bandwidth). The cluster tree's two splits 
and subsequent three leaves correspond to the density estimate's modes. The tree is also indexed by A, the 
density estimate's height, on the left and a, the probability content, on the right. For example, the second 
split corresponds to A = 0.086 and a = 0.257. We note here that determining the true clusters for even this 
seemingly simple univariate distribution is not trivial for all A; in particular, values of A near 0.04 and 0.09 
will give ambiguous results. 

In this paper we study some properties of clusters defined by density level sets and cluster trees. In 
particular, we consider their estimators based on a kernel density estimate and show how the bandwidth h 
of the kernel affects the risk of these estimators. Then we investigate the notion of stability for density-based 
clustering. Specifically, we propose two measures of instability. The first, denoted by E n (h), measures the 
instability of a given level set. The second, denoted by T n (h), is a more global measure of instability. 

Investigation of the stability properties of density clusters is the main focus of the paper. Stability has 
become an increasingly popular tool for choosing tuning parameters in clustering; see von Luxburg (2009), 
Lange et al. (2004), Ben-David et al. (2006), Ben-Hur et al. (2002), Carlsson and Memoli (2010), Mein- 
shausen and Buhlmann (2010), Fischer and Buhmann (2003), and Rinaldo and Wasserman (2010). The 
basic idea is this: clustering procedures inevitably depend on one or more tuning parameters. If we choose 
a good value of the tuning parameter, then we expect that the clusters from different subsets of the data 
should be similar. While this idea sounds simple, the reality is rather complex. Figure 2 shows a plot of S„ 
and T n for our example. We see that 2„(/i) is a complicated function of h while T n (h) is much simpler. Our 
results will explain this behavior. 



2 



-5 5 10 

X 

Figure 1: The cluster tree for a density estimate of a sample from the mixture (4/7)iV(0, 1) + (2/7)JV(3.5, 1) + 
(l/7)iV(7, 1); the tree is indexed by both A (left) and a (right). 



In Section 2 we state some notation, assumptions on the density and discuss the kernel density estimate. 
In Section 3 we construct plug-in estimates L{\) of the level set L{\), T of the cluster tree T, and M(a) 
of the level set indexed by probability content M(a). In Section 4 we define and study a notion of the 
stability of L(X) and extend it to T. We also consider an alternative version of our results when the level 
sets are indexed by probability content. We then describe another notion of stability of cluster trees based 
on total variation that leads to a constructive procedure for selecting the kernel bandwidth. In Section 5 
we consider some numerical examples. Section 6 contains a discussion of the results and the proofs are in 
Section 7. Throughout, we use symbols like c, ci, c%, . . . , C, C%, C%, . . . , to denote various positive constants 
whose value can change in different expressions. 

2 Preliminaries 
2.1 Notation 

For x € M d , let ||x|| denote its euclidean norm. Let B(x, e) = {y : \\x — y\\ < e} C R d denote a ball centered 
at x with radius e. For any set A c M d and any e > 0, let A © e = \J xeA B(x, e). Let v d = -fj " , denote the 

volume of the unit ball. The Hausdorff distance between two sets A and B is 

doo(A, B) = inf{e : ic(Bffle) and B C (A © e)}. (3) 
Finally, let A c denote the complement of set A and let 

AAB = (AnB c )\J{A c nB) 

denote the symmetric set difference. 

We will be considering samples of n independent and identically distributed random vectors from an 
unknown distribution P on R d . If X and Y are such samples, we will denote with Px.y the probability 
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Figure 2: Plots of the fixed-A instability (top) E n (h) for A = 0.09 and of the total variation instability T n (h) 
(bottom) for the mixture distribution in Figure 1 as functions of the bandwidth h. 



measures associated to them and with Ex,y the corresponding expectation operator. Thus, if A is an event 
depending on X and Y, we will write Vx.y{A) for its probability. Finally, for a sample X = (Xi, . . . , X n ), 
we will denote with Px the empirical measure associated with it; explicitly, for any measurable set A c R d , 



Px{A) 



E 



I{Xi G A). 



2.2 Assumptions 

We will use the following assumptions on the density p: 
(AO) Compact Support - The support S of p is compact. 
(Al) Lipschitz Density - Assume that 



p E S(j4) ee <; p : \p(x) - p(y)\ < A\\x - y\\, for all x, y G S 



(4) 



for some A > 0. 



(A2) Loca? density regularity - For a given density level of interest A, there exist constants < k\ < k 2 < oo 
and < e such that, for all e < e , 



Kie < P({x: \p(x) - A| < e}) < n 2 e. 



(5) 
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It is possible to formulate this condition more generally in terms of power of e, that is e a . But, as argued 
in Rinaldo and Wasserman (2010), the above statement typically holds with a = 1 for almost all A. 

Some of the results will only require a subset of these assumptions, and this will be explicitly mentioned 
in the statement of the result. Assumptions (Al) and (A2) characterize the density regularity - (Al) implies 
that the density cannot change drastically anywhere, while (A2) implies that the density cannot be too flat 
or steep locally around the level set. Also notice that (AO) and (Al) together imply that the density p is 
bounded by some positive constant p max < oo. These assumptions are stronger than necessary, but they 
simplify the proofs. Notice in particular, that assumption (A2) can rule out the case of sharp clusters, in 
which S is the disjoint union of a finite number of compact sets over which p is bounded from below by a 
positive constant. 

2.3 Estimating the Density 

To estimate the density p based on the i.i.d. sample X = {X\, . . . , X n ), we use the kernel density estimator 

i=l v ' 

where K is a symmetric kernel with compact support and h > is the bandwidth. Let ph{u) = ^x\ph,x{u)}. 
Note that ph is the density of 

P h = P®K h 

where denotes convolution and K h denotes the probability measure of a random variable with density 
K h (t) = h- d K(\\t\\/h). 

We impose the following conditions on ph- 

(BO) The support of P h is compact for all h > 0. 

(Bl) p h G E(A). 

(B2) For a given density level A, there exist positive constants m < k 2 , £o an d H bounded away from and 
oo, such that, for all < e < e , 

Kit < inf P ({x : \p h {x) - A| < e}) < sup P({x : \p h {X) - A| < e}) < n 2 e. 
0<h<H a<h<H 

(B3) For a given a, there exist positive constants k 3 , 770 and H bounded away from and 00, such that, for 
all < 77 < 770, 

sup d 00 (M h (a),M h (a + 77)) < k 3 |j?|. 

0<h<H 

Here M h (a) = {u : p h (u) > A Q }. 

We remark that condition (BO) follows from (AO) and the compactness of the kernel, while (Bl) follows 
directly from (Al) (for a formal argument, see the end of the proof of Lemma 3.4). We state them as 
assumptions for clarity. 

The more stringent conditions (B2) and (B3) are used only for some specific results from Section 4.1 
and Section 3.2, respectively. This will be explicitly mentioned in the statement of such results. In partic- 
ular, condition (B2) is needed in order to explicitly state the behavior of the instability measure we define 
below. We believe this assumption follows from Condition (A2) on the true density p and using kernels with 
compact support, however for technical convenience we state it as an assumption. This assumption holds 
for all density levels that are not too close to a local maxima or minima of the density. Assumption (B3) 
characterizes the regularity of the level sets of py L and essentially states that the boundary of these level sets 
is well-behaved and not space-filling. 
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Our analysis depends crucially on the quantity \\pu,x ~ Ph\ao — sup ugRd \ph,x(u) — Ph(u)\, for which 
we use a probabilistic upper bound that follows from the arguments in Gine and Guillou (2002), under the 
following assumption on the kernel K. 



(VC) The class of functions 



T h = {K\ ) ,x G R a ,h > 



satisfies, for some positive number A and v 



supN(F h ,L 2 (P),e\\F\\ L2{P) )< [ - 



where N(T; d; e) denotes the e-covering number of the metric space (T, d), F is the envelope function 
of T and the supremum is taken over the set of all probability measures P on R d . The quantities A 
and v are called the VC characteristics of T. 

Assumption (VC) holds for a large class of kernels, including, any compactly supported polynomial kernel 
and the Gaussian kernel. The lemma below follows from Gine and Guillou (2002) (see also Rinaldo and 
Wasserman, 2010). 

Lemma 2.1. Assume that the kernel satisfies the VC property, and that 

sup sup / K\(t - x)dP(x) < B < oo. (7) 

t£M d h>0 JTg. d 

There exist positive constants K%, K 2 and C, which depends on B and the VC characteristic of K such that the 
following hold: 

1. For every e > and h > 0, there exists n(e, h) such that, for all n > n(e, h) 

Vx (\\Ph,x ~ PhWoo > e) < K x e~ K ^ h \ (8) 

2. Let h n -> as n -> oo in such a way that 

nh d 

" ^oo. (9) 



logn 

Then, there exist a constant K 3 and a number n such that, setting e n = 



K 3 log n 
nhi ' 



^x{\\Ph n ,X-Ph n \\oo>€n)< ~ (10) 



for all n > n a . 



The numbers n(e, h) and n depend also on the VC characteristic of K and on B. Furthermore, n(e, h) is 
decreasing in both t and h. 

This result requires virtually no assumptions on p and only minimal assumptions about the kernel, which 
are satisfied for all the usual kernels. 

The constrain in equation (9), which in general cannot be dispensed with, has a subtle but important 
implication for our later results on instability. In fact, it implies that the bandwidth parameter h n is only 

allowed to vanish at a slower rate than (^fp) • As a result, our measures of instability defined in Sections 

f\ \ 1/d 

4.1 and 3.2 can be reliably estimated for values of the bandwidth h » I -^fp ) . Indeed, the threshold 

/, \l/d 

value ^ is of the same order of magnitude of the minimal spacing among the points in a sample of 
size n form P. See Deheuvels et al. (1988) and, in particular, Lemma 4.3 below. 
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3 Estimating the level set and cluster tree 



For a given density level A and kernel bandwidth h, the estimated level set is L htX (X) = {x : Ph,x{x) > A}. 
The clusters (connected components) of Lh, x (\) are denoted by Ch,\ and the estimated cluster tree is 

t h = |J Ch,\- (11) 

A>0 

3.1 Fixed A 

We measure the quality of L h . x (\) as an estimator of L(X) using the following loss function: 



£{h,X,\) = f p(u)du 

JL(\)AL h x (\) 



(12) 

I L(\)AL h x W 

where we recall that A denotes symmetric set difference. The performance of plug-in estimators of density 
level sets has been studied earlier, but we state the results here in a form that provides insights into the 
performance of instability measures proposed in the next section. 

Theorem 3.1. Assume that the density p satisfies the conditions (AO) and (Al ) and that the kernel K satisfies 
f K(z)dz = 1 and f \\z\\K{z)dz < D. For any sequence h n = uj{{\ogn/n) 1 / d ), let 



I K 3 log n 



n 



hi 



and 

r hn ,e n ,\ = P{{u: \p{u) - A| < ADh n + e„}) . 

Then, for all large n, 

Fx (jC{K,X,\)<r hn ^ x ) > 1- i. 
If the assumption (A2) holds for density level X, then for all large n, 

F x (c(h n ,X,X) < K 2 (ADh n + e n )j > 1 - i. 

The following corollary characterizes the optimal scaling of the bandwidth parameter h n that balances 
the approximation and estimation errors. 

Corollary 3.2. The value ofh that minimizes the bound on Lis 



n 



K = c{ j£i) d+2 (13) 



where c > is an appropriate constant. 



3.2 Fixed a 



Often it is more natural to define the high-density clusters or level sets by the probability mass contained 
in the high-density region, instead of the density level. The level set estimator indexed by the probability 
content a e (0, 1) is given as 

M htX {a) = L h , x (Xh,a,x) 
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where 

Xh,a,x = sup|a : Fx- ({it : ph,x{u) > A}) > aj, (14) 

ph,x is the kernel density estimate computed using the data X with bandwidth h. This estimator was studied 
by Cadre et al. (2009), though using different techniques and in different settings than ours. 
Let a € (0, 1) be fixed and define 

Xh,a = sup{A : P(p h (X) > A) > a}. 

We first show that the deviation | A^ iQ — A Q | is of order h, uniformly over a, under the very general assumption 
that the true density p is Lipschitz. 

Lemma 3.3. Assume the true density p satisfies the conditions (AO) and (Al ). Then, for any h > 0, 

sup \\ h , a -\ a \ < ADh, (15) 

q£(0,1) 

where D = f Rd \\z\\K(z)dz. 

Remark: More generally, if p is assumed to be Holder continuous with parameter (3 then, under additional 
mild integrability conditions on K, it can be shown that |A/j jQ — A a | = 0{h^), uniformly in a. 

The following lemma bounds the deviation of \Xh. a ,x — ^h, a \- 

Lemma 3.4. Assume that the true density satisfies (AO)-(Al) and the density level sets of p h corresponding to 
probability content a satisfy (B3). Then, for any < h < H, any e < r] — l/n, and for all large n, 



x 



(\k,a,x - Xh, a \ > e(A K3 + 1) + A K3 /n) < K^-** 1 *** + Sne"™^/ 32 , (16) 

where A is the Lipschitz constant and k 3 is the constant in (B3). 

Using Lemma 3.3 and Lemma 3.4, we immediately obtain the following bound on the deviation of the 
estimated level \h, a ,x from the true density level X a corresponding to probability content a. 

Corollary 3.5. Under the same conditions of Lemma 3.4, 

V x (\h, a ,x - A„| > ADh + e(Ak 3 + 1) + AK 3 /n^j < K x e~ K * nhd ^ + 8ne~ n€ ^ 32 . 

We now study the performance of the level set estimator indexed by probability content using the follow- 
ing loss function 

C(h,X,a) = P[M(a)AM h! x(a)) = [ p(u)du. 

J M{a)AM hiX (a) 

Theorem 3.6. Assume that the density p satisfies conditions the conditions (AO) and (Al ) and the level set of 
Ph indexed by probability content a satisfies (B3). For any sequence h n = uj((\ogn/n) 1 / d ), let 

/ if 3 log n 

e --y nh* ■ 

Set 

C 1>n = ADh n + e n , C 2 , n = ADh n + (Ak 3 + l)e„ + An 3 /n. 

and 

P({u: \p(u)-X a \<C hn + C 2 , n }). 
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Then, for h n = w((logn/n) 1 / d ) and h„ < H, we have for all large n, 



F x (£{h n ,X,a) < r hn , en , a ) > 1 



2 



7? 



In particular, if the assumption (A2) also holds for density level X a , then for all large n, 



P x {£(h n ,X,a) < Ka(Ci, n + C 2 ,„)) > 1 



2 



n 



Corollary 3.7. The value ofh that minimizes the upper bound on Cis 




( 



logn 



n 



) 



d | 2 



(17) 



where c > is a constant. 



4 Stability 



The loss £ is a useful theoretical measure of clustering accuracy. Balancing the terms in the upper bound 
on the loss gives an indication of the optimal scaling behavior of h. But estimating the loss is difficult and 
the value of the constant c in the expression for h* n is unknown. Thus, in practice, we need an alternate 
method to determine h. Instead of minimizing the loss, we consider using the stability of Lh,x{X) an d 7~h to 
choose h. As we discussed in the introduction, stability ideas have been used for clustering before. But the 
behavaior of stability measures can be quite complicated. For example, in the context of k-means clustering 
and related methods, Ben-David et al. (2006) showed that minimizing instability leads to poor clustering. 
On the other hand, Rinaldo and Wasserman (2010) showed that, for density-based clustering, stability-based 
methods can sometimes lead to good results. This motivates us to take a deeper look at stability for density 
clustering. 

In this section, we investigate two measures of stability which we denote by S„(/i) and T n (h). The 
measure E n (h) is the stability of a fixed level set, as a function of h. We will see that S„ has surprisingly 
complex behavior. See Figure 2. First of all, S„(0) = 0. This is an artifact and is due to the fact that the 
level sets get small as h — > 0. As h increases, E n (h) first increases and then gets smaller. Once it gets small 
enough, the level sets have become stable and we have reached a good value of h. However, after this point, 
S n (/i) continues to rise and fall. The reason is that, as h gets larger, Ph(x) decreases. Every time we reach a 
value of h such that a mode of ph has height A, S„(/i) will increase. E n (h) is thus a non-monotonic function 
whose mean and variance become large at particular values of h. This behavior will be made explicit in the 
theory and simulations that follow. As a practical matter, we can exclude all values of h before the first local 
maximum of S n (/i). Then, a reasonable choice of h is the smallest value for which S„(/i) is less than some 
pre-specifled level (3. 

The second stability measure T n (h) is a more global measure of stability. When T„ (h) is small, the whole 
cluster tree is stable. It turns out that the behavior of T n (h) is much simpler. It is monotonically decreasing 
as a function of h. In this case we can choose h to be the smallest h for which T n (h) < j3. 

The motivation for this choice of h is the following. We cannot estimate loss exactly. But we can use the 
instability to estimate variability. Our choice of h corresponds to making the bias as small as possible while 
maintaining control over the variability. This is very much in the spirit of the Neyman-Pearson approach 
to hypothesis testing where one tries to make the power of a test as large as possible while controlling the 
probability of false positives. Put another way, Ph = P © Kh has a blurred version of the shape information 
in P. We are choosing the smallest h such that the shape information in Ph can he reliably recovered. 

Before getting into the details, which t urn out to be somewhat technical, here is a very loose description 
of the results. For large h, T n (h) rs l/Vnh d . On the other hand, S„(/i) tends to oscillate up and down 
corresponding to the presence of modes of the density. In regions where it is small, it also behaves like 



9 



4.1 Level Set Stability 

In this section we focus on a single level set indexed by the density level A. Fix some A > 0. Consider two 
independent samples X = (X%, . . . , X„) and Y = (Yi, . . . , Y n ). Let 

£(h) = E XY (p(L, hX (X)AL KY (X))) . (18) 



Thus, £(h) measures the disagreement between level sets based on two samples. 

The definition of £ depends on P which, of course, we do not know. To estimate £(h) we proceed as 
follows. For simplicity, assume that the sample size is 3n. We randomly split the data into three pieces 
(X, Y, Z) each of size n. Let pu t x be the density estimator constructed from X and ph.y be the density 
estimator constructed from Y. Let P z denote the empirical distribution of Z. The sample instability statistic 
is 

3„(/i) = P z (L h , x (X)AL h , Y (X)), (19) 

and its expectation is 

Z(h)=E XiY>z [E n (h)]. 

Note that since we are using the empirical distribution P z , the sample instability can be rewritten as 

n 

SnW = -y2l(Z i e(L h , x (X)AL h , Y {\))) (20) 
n * — ' 

i=i 

1 - 

= - V/(sign(p h .x(^i)-A)^sign(p h ,y(Z i )-A)). (21) 
n 

For a fixed A, we count the fraction of the observations in Z where Ph.x(Zi) < A < ph lY (Zi) or Ph,x(Zi) > 
A > Ph,Y (Zi). This representation is closely tied to the use of the sample level sets to construct the cluster tree 
(Stuetzle and Nugent (2009)) where each level set is represented only by the observations associated with 
its connected components rather than the feature space. Using the empirical distribution P z also removes 
the need to determine the exact shape of the density estimate's level sets. The top graph of Figure 2 shows 
the sample instability as a function of h for A = 0.09 for our example distribution. Note that the instability 
initially drops and then oscillates before dropping to zero at h — 7.08, indicating the multi-modality seen in 
Figure 1. More discussion of this example is in Section 5. 

We now present the following simple but important boundary properties of H„ and £. The proof is 
straightforward and is omitted. 

Lemma 4.1. For fixed n, \im h ^ = lim/^oo €(h) = 0, and lim/,->o S„(/i) = lim/^oo E„(h) = a.s. 

We now study the behavior of the mean function £(h). Let u € R d , h > and e > 0, and define 

7T h (u) = P x (Ph,x(u) > A) and U h>e = {u: \p h {u) - X\ < e}. (22) 
Theorem 4.2. Let ueR d , h>0 and e > 0. 

1. The following identity holds: 

£(/i) = 2 / 7r h (u)(l-7r h (u))£iP(u). 



2. For all large n, 

where r, he = P{U h ^ t ), 

and 



r h ,e A Kt < < r Ke A Ke + 2K ie - K * nh £ , (23) 

A h . e = SUp 27T/ l (lt)(l - TTh(u)) 



A h ,e= inf 2n h (u)(l-ir h (u)). 
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Figure 3: Top plots and left bottom plot: two densities p h corresponding to the mixture distribution of Figure 
1 for h = 0, i.e. the true density, (in black) and h = 4.5 (in red); the horizontal lines indicate the level set 
value of A = 0.09, A + e and A — e, for e equal to 0.02, 0.05 and 0.1. Right bottom plot: probability content 
values r/i, e as a function of h G [0, 4.5] for the three values of e. 

Part 2 of the previous theorem implies that the behavior of £ is essentially captured by the behavior of 
the probability content rh,e- This quantity is, in general, a complicated function of both h and e. While it is 
easy to see that, for fixed h and a sufficiently well-behaved density p, rh, e as e — > 0, for fixed e, ru,e can 
instead be a non-monotonic function of h. See, for example, the bottom right plot in Figure 3, which displays 
the values r^e as a function of h € [0,4.5] and for e equal to 0.02, 0.05 and 0.1 for the mixture density of 
Figure 1. In particular, the fluctuations of r h . c as a function of h are related to the values of h for which the 
critical points of ph are in the interval [A — e, A + e]. The main point to notice is that r/ l c is a complicated, 
non-monotonic function of h. This explains why T n (h) is non-monotonic in h. 

fx \ 1/d 

As mentioned at the end of section 2.3, for values of h « I j , smaller than minimal spacing 

among the sample points, the kernel density estimate ph is no longer a reliable estimate of ph- We describe 
this effect on the expected instability in our next result. 

Lemma 4.3. Fix A > 0. Then, for any fixed n large enough, £(h) = 0(h d ) as h -t 0. 
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We now provide an upper and lower bound on the values of Ah, t and A h e , respectively, under the 
simplifying assumption that K is the spherical kernel. Notice that, while Ah i€ remains bounded away from 
oo for any sequence e„ — » and h n = uj(n~ 1 / d ), the same is not true for A h , which remains bounded away 
from as long as e„ = ©(^r) and h„ — uj(n~ 1 / d ). 

Lemma 4.4. Assume that K is the spherical kernel and let < e < A/2. For a given S e (0, 1), let 

h{S,e) = sup{/i : sup P(B(u, h)) < 1 - s\. (24) 

Then, for all h < h(S, e), 

A h e < 2 I 1 - $ I -Vnh d e 



2v d \ , C(<5,A) x2 



3A / 
and 



<5A / VnF 

where <£> denote the cumulative distribution function of a standard normal random variable and 



C(6,\) = 3 4 



Sv d \ ' 



The dips in Figure 2 correspond to values for which ph does not have a mode at height A. In this case, (B2) 
holds and we have r/ lj£ = 0(e). Now choosing e ps y/\ogn/ (nh d ) for the upper bound and e ps yjl/(nh d ) for 
the lower bound, we have that and A h e are bounded, and the theorem yields 



Next we investigate the extent to which E n (h) is concentrated around its mean £(h) = E[E n (h)]. We first 
point out that, for any fixed h, the variance of the instability can be bounded by £(h)(l/2 — £(h)). 

Lemma 4.5. For any h > 0, 

vax[z n (h)} < m « m Q - m 

The previous results highlight the interesting feature that the empirical instability will be less variable 
around the values of h for which the expected instability is very small (close to 0) or very large (close to 
1/2). 

Lemma 4.6. For any h > 0, e > 0, r\ e (0, 1) let t be such that 

t(l-p)>r Kt + 2K 1 e- K ^ h \ (26) 
where r h _ e = P(Uh, e )- Then, for all large n, 

Vx,y,z (|3 n (/i) - ah)\ >t)< e- ntc " + 2K ie - nK - hd * 2 (27) 

where 
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4.2 Stability of level sets indexed by probability content 

As in the fixed-A case, we assume for simplicity that the sample has size 37i and split it equally in three parts: 
X, Y and Z. We now define the fixed- a instability as 

E n (h,a) = P z {M h ,x(a)AM hx (a)), 

where 

M h! x(a) = {x: p h ,x{x) > Xh, a ,x}, (28) 

with \h, a ,x estimated as in (14) using the points in X; we similarly estimate Mh,y{a). As before, Pz denote 
the empirical measure arising from Z. Again, we use the observations to represent Mh,x, Mh,y as done for 
3 n (/i) for a fixed A. Examples of S n (/i, a) as a function of h, a can be seen in Section 5. 
The expected instability is 

£(/i,a) =E x ,Y,z[Z n (h,a)}. 
We begin by studying the behavior of the expected instability. 
Theorem 4.7. Let u € R d , h > and e > 0, and set 

Kh, a (u) = Px(Ph,x(u) > \h,a,x) and U h<2 e,a = {u: \ph(u) - X a< h\ < 2e}. 

1. The expected instability can be expressed as 

£(h, a) = E x ,Y.z[z-n(h,a)] = 2 / wh, a {u)(l - Tr hia (u))dP(u). 

2. Let e < t)q — 1/n and e = e(An 3 + 1) + An 3 /n. Then, for all large n, 

P(U h ,2e, a )A h ,e, a < < P(U h ,2i, a )A h ^ a + 4K ie - K ^ hdi2 + 16ne~ ne ' >/32 , 

where 



and 



,e,a= Sup 2-K htOL (u){l - K h ,cx{u)) 



Ah,e,a= i r nf 27T hja (u)(l -7Tfc, a (u)). 



3. Assume in addition that K is the spherical kernel and that e < inf/, For a given 5 e (0, 1), let 



h(S,c,a)=swp\h: sup P(B(u, h)) < 1 - d\. (29) 



uGU h 

Then, for all h < h(S, e, a), 

A h . ea <2(l-*(-3^ 2v ^ ■ 



3^h,a J v nh d 



8^h,aJ \ / nh d 

where $ denote the cumulative distribution function of a standard normal random variable and 



and 



4 V Sv d Xh a 
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As for the fluctuations of E n (h, a) around its mean, we can easily obtain a result similar to the one we 
obtain in Lemma 4.6. 

Lemma 4.8. For any h > 0, e > 0, n e (0, 1) let e = c(Ak 3 + 1) + Ak 3 /u and t be such that 
where r h ,e,a = P({u: \ph{u) - Xh, a \ < 2e}). Then, for all large n, 

Pxyz (\En(h, a) - £(h, a)\>t)< e - ntc " + AK^^"' 2 + l^ne^ 1/32 < S + 2K X cxp \-nK 2 h d e 2 } . 

3n 

(30) 

with 

The proof is basically the same as the proof of Lemma 4.6, except that we have to restrict our analysis to 
the event in (52). We omit the details. 



4.3 Stability for density cluster trees 

The stability properties of the density tree can be easily derived from the results we have established so far. 
To this end, for a fixed h > 0, define the level set of p h 

LhW = {u : p h (u) > A} 

and recall the level set estimate 

Lh,xW = {u - Ph,x(u) > A}. 

Let Nh(\), Nh,x{ty be the number of connected components of the sets Lh(X) and Lh,x(X), respectively. 
Notice that L h ^ x (X) is a random set. Also, denote with C\,..., Cjv h (A) an d Ci, . . . , C^ h x ^ A j the connected 

components of L h (X) and L htX (X), respectively. 

When building cluster trees, the value of the bandwidth h is kept fixed and the values of the level A 
vary instead. It has been observed empirically (see, e.g. Stuetzle and Nugent, 2009) that the uncertainty of 
cluster trees depend on the particular value of A at which the tree is observed. In order to characterize the 
behavior of the density tree, we propose the following definition. 

Definition 4.9. A level set value A is (h, e)-stable, with e > and h > 0, if 

N h (X) =N h (X'), VA' e (A-e,A + e) 

and, for any A — e < Ai < A 2 < A + e, 

d(X 2 ) C d(Xi), Vi=l,...,N h (X). 

If the level A is (h, e)-stable, then the cluster tree estimate at level A is an accurate estimate of the true 
cluster tree, in a sense made precise by the following result, whose proof follows easily from the proofs of 
our previous results and Lemma 2 in Rinaldo and Wasserman (2010). 

Lemma 4.10. If X is (h, e)-stable, then, for all n large enough, with probability at least 1 — ^ 

1. N h (X) = Nh,x(X); 

2. there exists a permutation a on {1, . . . , Nh(X)} such that, for every connected component Cj of Lh(X — e) 
there exists one C a u\ for which 
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3. P(L h>x (X)AL h (X)) < P({u : \p h (u) - A| < e}). 



Remarks. 

1. The values of A which are not (h, e)-stable are the ones for which 

inf ||Vp h («)||=0, 

for some A' € (A — e, A + e). For those values, the probability of Nh(\) ^ Nh,x{X) can be quite large, 
since the set L h ^ x AL h {X) may have a relatively large P-mass. 

2. Conversely, if p h is smooth (which is the case if, for instance, the kernel or p are smooth) and 
inf ue [/ A h e \WPh{ u )\\ > 3, then A is (h, e)-stable for a small enough e. 

The above result has a somewhat limited practical value, because the notion of a (h, e)-stable A depends 
on the unknown density ph- In order to get a better sense of which A's are (h, e)-stable or not, we once again 
resort to evaluate the instability of the clustering solution via data splitting. In fact, essentially all of our 
previous results about instability from section 4.1 carry over to these new settings by treating h fixed and 
letting A vary To express this changes explicitly, we will adopt a slightly different notation for quantities we 
have already considered. In particular, we let 

U x , e = {u: \p h {u) - A| < e} 

TX,e = P(U X ,e) 

nx{u) = P x (Ph,x(u) > A) 

Ax,e = SUP 27T A (u)(l - TTx(u)) 

Ax,e - mf 27r A («)(l-7r A (u)). 



We divide the sample size into three distinct groups, X, Y and Z, of equal sizes n. Define the instability 
of the density cluster tree as the random function T n : M> H> [0, 1] given by 

\^¥ z {L h>x (X)AL h}Y (X)). 

Also, let 

r(X)=E XtY , z [T n (X)}. 

For any fixed h, the behavior of T„(A) and r(A) is essentially governed by ?' A ,e- The following result describes 
some of the properties of the density tree instability. We omit its proof, because it relies essentially on the 
same arguments from the proofs of the results described in section 4.1. 

Corollary 4.11. 

1. For any A > 0, the expected density tree instability can be expressed as 



r(A) = 2 / 7T A («)(1 - nx(u))dP(u). 

2. For any e > and A > 0, 
for all n large enough. 



Ax >6 rx,e < r(A) < A Xte rx,e + 2K 1 e' K - nh ^ 2 , 
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3. Assume that K is the spherical kernel. For any X > 0, let < e < | and let 

5=1- sup P(B(u,h)). 

u 

Then, 

A Ai£ < 2 ^1 - $ (-Vnf?e 

and 



2v d \ , C((5,A) n2 



3A / yfnhfi 

A w > 2 I 1 - $ I Vnh d e° 



2 Vd \ c(s,xy 2 



SX ) yjnhd 

where $ denote the cumulative distribution function of a standard normal random variable and 



C(5,X) = 3: ' ; 2 



4 V Sv d X 

4. For any h > 0, e > 0, 77 e (0, 1) Zet t by suc/i t/iat 

i(l-r?) > r A , e + 2X ie -^" e2,1< \ (31) 
T/ien, /or aZZ n that are large enough 

Px,yz (\T n (X) - t(X)\ >t)< e- ntc - + 2K x e- nK - h ^\ (32) 

with 

4.4 Total Variation Stability 

In the previous section, we established stability of the cluster tree for a fixed h and all levels A that are 
(h, e)-stable. To establish stability of the entire cluster tree, we will now consider an even stronger notion of 
instability. Let B denote all measurable subsets of R d . Define the total variation instability 



T n (h) = sup 



Ph,x(u)du- / p h .y{u)du 



\Ph,x{u) -p h ,Y(u)\du 



where the latter equality is a standard identity. Requiring T n (h) to be small is a more demanding type of 
stability. In particular, B includes all level sets for all A. Thus, when T n (h) is small, the entire cluster tree 
is stable. Note that T n (h) is easy to interpret: it is the maximum difference in probability between the two 
density estimators. And of course < T n (h) < 1. The bottom graph in Figure 2 shows the total variation 
instability for our example distribution in Figure 1. Note that T„ (h) first drops drastically as h increases and 
then continues to smoothly decrease. 

We now discuss the properties of T n (h). Note first that T n (h) sa 1 for small h so the behavior as h gets 
large is most relevant. 

Theorem 4.12. Let T-L n be a finite set ofbandwidths such that \H n \ = An a , for some positive A and a e (0, 1). 

Fix a S e (0, 1). 

1. (Upper bound.) There exists a constant C such that, for all n large enough and such that 8 > A/n, 

Px,y (r„(/i) < t h for all h € H n ) >1-S, 



where t h - ■ 



nh d 
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2. (Lower bound.) Suppose that K is the spherical kernel and that the probability distribution P satisfies the 
conditions 

aih d v d < inf P(B(u, h)) < sup P(B(u, h)) < h d v d a 2l Wi > 0, (33) 

for some positive constants a x < a 2 , where S denotes the support of P. Let Let h* be such that sup u P(B(u, h*)) < 
1 — 5. There exists a t, depending on 5 but not on h, such that, for all h < and all n large enough, 

V X ,Y ^ n (h)>tJ^J>l-6. 

3. r„(0) = 1 and T n (oo) = 0. 
Remarks. 

1. Note that the upper bound is uniform in h while the lower bound is pointwise in h. Making the lower 
bound uniform is an open problem. However, if we place a nonzero lower bound on the bandwidths in 
T-L n then the bound could be made uniform. This approach was used in Chaudhuri and Marron (2000). 

2. Conditions (33) are quite standard in support set estimation. In particular, when the lower bound 
holds, the support S is said to be standard. See, for instance, Cuevas and Rodriguez-Casal (2004). 

In low dimensions, we can compute T n (h) by numerically evaluating the integral 

2 J \Ph,x(u) - p h ,Y{u)\du. 

In high dimensions it may be easier to use importance sampling as follows. Let g(u) — (l/2)(ph,x(u) + 

p h , Y {u)). Then 

r ,,s 1 [ \ph,x{u) - p h ,Y(u)\ , . 1 ^ \ph,x(Ui) -Ph,Y(Ui)\ 

T n{h) = - / — g{u)du « — > ; — -t 

2 J g{u) N ^\Ph,x{Ui) +Ph,Y{Ui)\ 

where Ui,...,Un is a random sample sample from g. We can thus estimate T n (h) with the following 
algorithm: 



1. Draw Bernoulli(l/2) random variables Z\, . . . , Zjy. 

2. Draw Ui, . . . , Un as follows: 

(a) lfZi = l: draw X randomly from X 1: ...,X n . Draw W ~ K. Set U % = X + hW. 

(b) If Z t = 0: draw Y randomly from Y u ...,Y n . Draw W ~ K. Set Ui = Y + hW. 

3. Set 

\Ph,x(Ui) -ph,Y{Ui)\ 



N ~! \Ph,x(Ui)+Ph,Y(Ui)\' 



It is easy to see that Ui has density g and that f n (h) — T n (h) = Op{l/yN) which is negligible for large 

AT. 
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Figure 4: Comparing L hiX (0.02) and L hiY (0.02) with h = 0.15 (top left), h = 0.35 (top right), h = 0.75 
(bottom left) and h — 0.95 (bottom right) for data sampled from the mixture distribution of Figure 1. The 
The two kernel density estimates are obtained using the X sample (solid line) and the Y sample (dotted 
line). Points in the Z sample are showed as short vertical lines on the z-axis, and are colored in red when 
they belong to L fe ,x(A)A£ ft ,y(A). 

5 Examples 

We present results for two examples where, although the dimensionality is low, estimating the connected 
components of the true level sets is surprisingly difficult. For the first example, we begin by illustrating 
how the instability changes for given values of A, a and then split each data set 200 times to find point- wise 
confidence bands for E n (h) for fixed \,a and for T n (h). We then present selected results for a bivariate 
example. 

5.1 Instability as function of fixed A 

Returning to the example distribution in Section 1, 600 observations were sampled from the following 
mixture of normals: (4/7)JV(0, 1) + (2/7)iV(3.5, 1) + (l/7)iV(7, 1). The original sample is randomly split into 
three samples of 200. All kernel density estimates use the Epanechikov kernel. We examine the stability at 
A = 0.02, a height at which the true density's connected components should be unambiguous, and A = 0.09, 
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Figure 5: Comparing L hiX (0.09) and L htY (0.09) for h = 0.5 (top left), h = 1.75 (top right), h = 3.75 (bottom 
left) and h = 6 (bottom right) for data sampled from the mixture distribution of Figure 1. The The two 
kernel density estimates are obtained using the X sample (solid line) and the Y sample (dotted line) . Points 
in the Z sample are showed as short vertical lines on the x-axis, and are colored in red when they belong to 

L hjX {X)AL htY (X). 



the height used in our earlier motivating graphs. 

We start by conceptually illustrating the instability for selected values of h in Figures 4, 5. In each 
subfigure, ph.x,Ph,Y are graphed for the Z set of observations. Levels A = 0.02, 0.09 are marked respectively 
with a horizontal line. Those observations in Z that belong to (A) and not to I^y (A) (or vice versa) are 
marked in red; the overall fraction of these observations is E n (h). In general, we can see that as h increases, 
the number of the red Z observations decreases. For A = 0.02, note that the location that most contributes 
to the instability is the valley around Z = 5. Once h is large enough to smooth this valley to have height 
above A = 0.02, the instability is negligible. Turning to A = 0.09 (Figure 5), even for larger values of h, the 
differences between the two density estimates can be quite large. When h is large enough such that both 
density estimates lie entirely below A = 0.09, our instability drops to and remains at zero. 

Figure 6 shows the overall behavior of S n (h) as a function of h. As expected, for A = 0.02, S n (/i) jumps 
for the first non-zero h and then quickly drops to almost zero by h = 1 (Figure 6, left). At A = 0.09, a height 
with a wide range of possible level sets (depending on the density estimate and the value of h), first 
drops and then oscillates as previously described as h increases, indicating multi-modality (Figure 6, right) . 
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Figure 6: 3 n (/i) as a function of the bandwidth h for A = 0.02 (left) and 0.09 (right) for data sampled from 
the mixture distribution of Figure 1. 

5.2 Instability as function of probability content 

In Section 4.2, we defined E n (h,a), the sample instability as a function of h and a. As done before, we 
conceptually illustrate E„(/i, a) for selected values of h and a = 0.50 and 0.95 in Figure 7. In each subfigure, 
Ph,x,Ph,Y again are graphed for the Z set of observations. The probability content of the density estimates 
are respectively indicated on the left and right axes. The values a = 0.50, 0.95 are also marked with solid 
and dashed horizontal lines for the two density estimates. Those observations in Z that belong to M h ^ x (&) 
and not to Mh,Y (a) (or vice versa) are marked in red; the overall fraction of these observations is E, n (h, a). 
In general, we can see that as h increases (for both values of a), the number of red Z observations decreases. 
This decrease happens more quickly for higher values of a (as expected) . 

In Figure 8, we examine E n (h,a) as a function of h for a — 0.50,0.95. For level sets that contain at 
least 50% probability content, i.e. 71^^^(0.50), the instability quickly drops as h increases and then oscillates 
as h approaches values that correspond to density estimates with uncertainty at those levels. Again, this 
ambiguity occurs due to the presence of the second mode (we would see similar behavior with respect to 
the smallest mode if a ~ 0.80). As h continues to increase, the density estimates become smooth enough 
that there is very little difference between Mh,x(0-50), ^^(0.50). This behavior also occurs when a = 0.95 
albeit more quickly (Figure 8, top right) since level sets that contain at least 95% probability content occur 
at lower heights and are more stable. 

Figure 8c is the corresponding heat map for a = 0, 0.01, . . . , 1.0 and h = 0, 0.01, . . . , 10. White sections 
indicate 3 n (h,a) ~ 0; black sections indicate higher instability values. In this particular example, the 
maximum instability of 0.425 is found at h = 0.03, a = 0.46. Note that around h = 3, we have very low 
instability values for almost all values of a, and hence this value of kernel bandwidth would be a good choice 
that yields stable clustering. 

5.3 Instability Confidence Bands 

The results in the previous subsections were for splitting the original sample one time into three groups of 
200 observations. Here we briefly include a snapshot of what the distribution of our instability measures 
look like over repeated splits. For computational reasons, we used the binned kernel density estimate, again 
with the Epanechikov kernel, and discretize the feature space over 200 bins; see Wand (1994). Increasing 
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Figure 7: Top: comparing M h X (0-50) and M h Y (0-50) for h = 2 (left) and h = 5 right). Bottom: comparing 
M h ,x(0-95) and M fti y(0.95) for /i = 0.4 (left) and h = 3.5 (right). The data were sampled from the mixture 
distribution of Figure 1. The The two kernel density estimates are obtained using the X sample (solid line) 
and the Y sample (dotted line). Points in the Z sample are showed as short vertical lines on the £-axis, and 
are colored in red when they belong to M^xfajAMdjfci). 



the number of bins improves the approximation to the kernel density estimate; the use of two hundred 
bins was found to give almost identical results to the original kernel density estimate (results not shown). 
We split the original sample 200 times and find 95% point-wise confidence intervals for S„(/i), T n (h), and 
E n (h,a) for a = 0.50,0.95 and as a function of h. The results are depicted in Figure 9. The confidence 
bands are plotted in red, the medians in black. The distribution of the instability measures for each value 
of h is also plotted using density strips (see Jackson, 2008); on the grey-scale, darker colors indicate more 
common instability values. The density strips allow us to see how the distribution changes (not just the 50, 
95% percentiles). For example, for the plot on the top left in Figure 9, note that right before h = 2, the upper 
half of the distribution of E n (h) is more concentrated. This shift corresponds to the increase in instability in 
the presence of the additional modes. 
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Figure 8: Top: H n (/i, a — 0.50) (left) and E n (h, a = 0.95) (right) as a function of h. Bottom: heat map of 
S„(/i, a) as function of h, a for the example of Figure 1. The data were sampled from the mixture distribution 
of Figure 1. 

5.4 Bivariate Moons 

We also include a bivariate example with two equal-sized moons; this data set with seemingly simple struc- 
ture can be quite difficult to analyze. The scatterplot of the data on the left in Figure 10 show two clusters, 
each shaped like a half moon. Each cluster contains 300 data points. The plot on the right in Figure 10b 
shows a two-dimensional kernel density estimate (for illustrative purposes) using a Gaussian kernel with 
default bandwidth and evaluation points. We can see that while levels around A = 0.30 show clear multi- 
modality the connectedness of the level sets around A = 0.15 are less clear. 

To examine instability we use a product kernel density estimate with an Epanechikov kernel and the 
same bandwidth h for both dimensions. Figure 11 shows the sample instability E n (h) as a function of h for 
A = 0.10, 0.20, 0.30 as well as the total variation instability T n (h) as a function of h. As expected, the higher 
the A, the more quickly the sample instability drops. We also see the possible presence of multi-modality for 
all three values of A in S„(/i). On the other hand, the total variation instability drops smoothly as h increases. 

Figure 12 contains the instability as a function of h and probability content a for all values of h, a (Figure 
12d) and specifically for a = 0.50,0.075,0.95. Again, as expected, E n (h,a) drops as h increases for smaller 
values of a. Note that for a — 0.95, the instability remains relatively low regardless of the value of h. When 
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Figure 9: 95% point-wise confidence bands for S n (h) (top left), T n (h) (top right), S„(/i, a = 0.50) (bottom 
left) and E n (h, a = 0.95) (bottom right) for data sampled from the mixture distribution of Figure 1. 



examining the heat map, we see that for small values of h, level sets corresponding to probability content 
around 0.4-0.6 are very unstable. This behavior is not unexpected given that the moons are of equal sizes 
and difficult to separate due to the noise. We would expect to have difficulty finding stable level sets "in the 
middle". 



6 Discussion 

We have investigated the properties of the density level set and density tree estimator based on kernel 
density estimates, and we have proposed and analyzed various measures of instability for these quantities. 
We believe these measures of instability can provide useful guidelines for choosing the bandwidth parameter 
and also as explorative tools to gain insights into the properties and shape of the data- generating distribution. 

Our analysis leaves some some open questions that we think deserve further attention. First, we have 
focused on kernel density estimators but the same ideas can be used with other density estimators or more, 
generally, with other clustering methods for which underlying tuning parameters have to be chosen in a data- 
driven fashion. See, for instance, Meinshausen and Buhlmann (2010) for a related stability-based approach 
to clustering. 
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Figure 10: Bivariate moons (left) and contours of a Gaussian kernel density estimate (right) for the example 
discussed in Section 5.4. 

We have assumed the existence of the Lebesgue density p but this assumption can be relaxed using 
methods in Rinaldo and Wasserman (2010) to allow for distributions supported on lower-dimensional, well- 
behaved subsets. This extension is potentially important because it would allows us to include cases where 
the distribution has positive mass on lower dimensional structures such as points and manifolds. 

Finally, in computing the various measures of instability, we have considered just a single split of the 
data into non-overlapping sub-samples. In fact, one can randomly repeat the splitting process and combine 
over many splits, which is how we obtained the confidence bands of Figure 9. Though the increase in 
the computational costs may be significant, repeated sub-sampling would yield a reliable estimate of the 
uncertainty of the chosen instability measures and would therefore be highly informative about the sample. 
We believe that the properties of S„ can be established using the theory of U-statistics. 

7 Proofs 

Proof of Theorem 3.1: Let Ah„,e„ denote the event that \\ph n ,x — Ph n ||<x> < £«• Then, for all n > n , by 
equation (10), Px(A„, E J > 1 — Also observe that Assumption (Al) implies that, for any h > 0, the 
sup-norm density approximation error can be bounded as 



The second step in the previous display follows since J K(\\z\\)dz = 1 and using Lipschitz assumption (Al) 
on the density, and the last step since | J \\z\\K(\\z\\)dz\ < oo. Putting the estimation and approximation error 
together, and using the triangle inequality, we obtain that, on the event Ah n ,e n , 




= ADh. 



(34) 



\\Ph n ,x ~P\\oo < ADh n + e 



(35) 
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Figure 11: E n (h) as a function of h for A = 0.10 (top left) 0.20 (top right) and 0.30 (bottom left). T n (h) as a 
function of h (bottom right) for the data depicted in Figure 10. 

for all n > n . Using equation (35), we conclude that, on Ah„ e „ and for all n > ni (n , A) so that ADh n +e n < 
A, 

L(\)AL hn , x (X) = {u:p(u)>\,p hn ,x(u)<\}{J{u:p<X,p hn ,x(u)>\} 

C {u: p(u) > X,p(u) < A + ADh n + e n } U {u: p(u) < \,p{u) > A - ADh n - e n } 
= {u: |p(w)-A| <ADh n + e n }. 

Then, on Ah n ,t n and for all n > ni(n , A) large enough 

C(h n ,X, A) = P(L(\)AL hn , x (\)) < r hnten , x , 

so that, V x {C{h n ,X, A) < r n ) > V x (A hn ,e n ) >l-±,as claimed. 

If (A2) is in force for the density level A, then for all n > n 2 (nQ, A, A, D, eo) so that ADh n + e„ < e , we 
have rh n} e n ,\ < n2(ADh n + e n ), which proves the second claim. 

Proof of Lemma 3.3: Using (Al) and the fact that J Rd K(\\z\\)dz = 1, Eq.(34) states that for any h > 

\\ph ~p\\oo < ADh. 
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Figure 12: s n (/i,a) as a function of h for a — 0.50 (top left) 0.75 (top right) and 0.95 (bottom left). Heat 
Map of E n (h, a) as function of h, a (bottom right). 

Then, for any a e (0, 1) and h > 0, 

{u: p(u) > X h ^ a + ADh} C {u: p h (u) > X h ^ a } C {u: p(u) > \ h ^ a - ADh}. 

And as a result, 

P({u: p(u) > X h . a + ADh}) < P({u: p h {u) > X Ka }) < P({u: p{u) > \ h , a - ADh}). 
Since P({u: p(u) > X a }) = a = P({u: Ph(u) > Xh,a}), we have 

P({u: p(u) > X Ka + ADh}) < P({u: p(u) > X a }) < P({u: p(u) > X h . a - ADh}). 
Consequently, 

Xh a + ADh > X a > X hia - ADh. 
It follows that for any a e (0, 1) and h > 

\X h , a - X a \ < ADh. 
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Proof of Lemma 3.4: Let C% = {{u: Ph(u) > A}, A > 0} denote the class of level sets of p h and define the 
events 

V h , £ = \ sup \P X {C) - P{C)\ < e) and A h>£ = {\\p h ,x - Ph\\oo < e} ■ 
Then, since the n-th shatter coefficients of Ch is n, 

Vx(Vl e )<8ne~ ne2/32 and V x (A c h J < K^-^ 2 ' 1 * , (36) 

where the first inequality follows from the VC inequality and the second inequality is just (8). Then, on Ah,e> 
we obtain 

{u: ph{u) > A + e} C {u: ph,x{u) > A} C {u: Ph(u) > A - e}, VA > 0. 

Thus, on Ah, e , 

P x {{u: Ph {u) > A + e}) < P x ({u: p h<X (u) > A}) < P x ({u: p h (u) > A - e}), 

uniformly over all A > and any h > 0. In particular, the previous inequality hold also for \ a ^.x (which is 
positive with probability one) for any a e (0, 1) and h > 0. 
Recalling that, by definition, 

\Px({u: p h ,x(u) > Xh,a,x}) -a\< l/n, 
we obtain, on the events Vh.t and Ah,e, 

P{{u: Ph (u) > X Ka . x + e}) - - - e < a < P{u: p h (u) > X hlCtt x - e}) + - + e. (37) 

n n 

Since a = P({u: Ph(u) > A; l Q }), the first inequality in (37) can be written as 

a+-+e = P({u: p h {u) > X h a+ i +e }) > P({u: p h (u) > \ h . ayX + e}) 

and the second one as 

a - i - e = P({u: p h (u) > X h i_J) < P{u: p h (u) > X hta , x - e}), 
n 

both holding on the events Vh.e and Ah,e- Combining the last two expressions, we obtain, on the same 
events, for any a e (0, 1) and h > 

A hjQ+ i +£ - e < X h , a ,x < A h Q _i_ £ + e. (38) 

We will now show that for level sets of ph indexed by a that satisfy (B3), for any r\ e (— 770, 770) and < h < H, 
we have 

|Aft,,a+77 - Aft :Q | < AK 3 |y7|. (39) 

In fact, (38) and (39) will imply, on the events Vh.t an d Ah, e , for level sets of p h indexed by a that satisfy 
(B3) since e + l/n < rj and < h < H , we have 



X h . a - Ak 3 ( e + - ) - e < X h . atX < K, a + Ak 3 ( e+ - 



from which, using (36), the claim will follow. 

In order to show (39), for a set A c K d , let dA denote its boundary. Then, notice that, because ph is 
Lipschitz and hence continuous, for every x € dM h (a), Ph{x) — Xh, a and, for every y £ dM h (a + 77), ph(y) = 
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A^a+jj. Furthermore, for any point x £ dMh(a), there exists a point y = y(x) = mi z&dMh i a+7] \ \\x — z\\. 
Thus, for 1 77 1 < 770, 

Ik -2/11 < dn(M h (a),M h {a + ri)) < k 8 |»?|, 
where the last inequality follows for level sets of ph indexed by a that satisfy (B3) and < h < H . Therefore, 

\^h,a+t) - Afc, Q | = \Ph(y) ~Ph(x)\ < A\\x - y\\ < Ak 3 \ti\, 

where in the first inequality we used the fact that, by (Al), ph is Lipschitz with constant A. Indeed, for any 
i/l/, using the Lipschitz assumption (Al) on p, 



\p h (x) - p h (y)\ < / \p(x + zh) - p{y + zh)\ K{z)dz < A\\x - y\\ / K{z)dz = A\\x - y\\ 



Proof of Theorem 3.6: Let Ah n , tn be event defined in the proof of Theorem 3.1, and recall that for all 
n > n , by equation (10), VxiA^^ e ) < 1/n and that, equation (35) states that 

\\Ph,x ~ plloo < Ci,„ (40) 
on that event, for all n > n . Also, let Vh n ,e n be the event defined in Lemma 3.4 such that Px(Ph 6 ) < 

8ne~" e "/ 32 . Then from Lemma 3.4 proof, we have that on the event A„,e„n?/i„. e „, for h n = w((log n/n) 1 / d ) 
and h n < H, 

\X hn , a ,x-K\<C 2 ,n (41) 
for all n > n 3 (n , r] ,K 3 ). Also, since n is large enough, we have 

8ne-" £ »/ 32 < -. 

n 

Therefore, for all such large n, both (40) and (41) hold with probability at least V x (Ah n , 6n n Vh n ,e n ) > 1 - ^ ■ 
Thus, on Ah n ,e n H Vh n ,e n , for /i„ = a;((logn/n) 1 / d ) and /i„ < iJ, we have for all 77 > 713(710, 770, -^3) 

M(a)AM h: x(a) = {u: p(u) > X a ,p h ,x(u) < X h , a ,x] U {u: p(u) < \ a ,p h , x (u) > X h , a ,x} 

C {7^: p(u) > X a ,p(u) < X h , a ,X + Ci.n} U {u: p{u) < X a ,p(u) > X h , a ,x - Cl,n} 

C {u: p(u) > X ai p{u) < X a + Ci,„ + C 2 ,„} U p(u) < X a ,p(u) > X a - Ci >n - C 2 ,„} 

= {u: \p{u) - X a \ < Ci >n + C 2: „}. 

Therefore, for for h n — to ((log n/n) 1 ' d ) and h n < H, we have for all 77 > 773(770, 770, K 3 ), 
F x (C(h n ,X,a) < r hn , en , a ) > P x (A hn , €n n V hn , en ) > 1 - -. 



Proof of Theorem 4.2: 

1. Since X, Y and Z are independent samples from the same distribution, ph,x{u) and ph,y(u) are inde- 
pendent and identically distributed, for any u £ R d and h > 0. Also, notice that for every measurable 
set A, E Z (P Z (A)) = P(A). Thus, 

£(h) = E x ,Y,z[ p z(W- Ph,x( u ) > X}A{u: p KY (u) > A})] 

= E x , r [P({u: > A,&,y(u) < A}) + P({u: p h , x {u) < X,p h , Y (u) > A})] 

= 2E x .y [P{{u; p h ,x(u) > Kph, y (u) < A})] 

= 2 f V X .Y (Ph,x(u) > X,p h , Y {u) < X)dP(u), (42) 
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where the last identity follows from Fubini theorem. The integrand in the last equation can be written 
as 

Vx.y (Ph.X («) > A, p h .Y (u) < A) = V X (Ph.X (u) > A) Py (p ft , Y («) < A) 

= Px (&,*:(«) > A) P x (0/, )Jt (u) < A) 
= n h (u)(l - ir h (u)), 

from which (22) follows. 
2. Let ^4; i e denote the event 

\\ph -Ph,x\\°o V \\ph -Ph.yWoc < e. (43) 
By (8), P X y(A c h J < 2K 1 er K2nhde2 . Letting l Ah t denote the indicator function of the event A h ,c, 

m < E x ,y,z[Pz({u: p h ,x(u) > \}A{u: p h<Y (u) > \})l Ah>e (X,Y)] +P Xl y(^ it ), 
and, using the same reasoning that led to (42), 

£(h) < 2 / P x , y ({p h .x(u) > \,p h . Y (u) < A} n dP{u) + F x , Y (A c h J 

Notice that, on -4./^, 

{w: Ph,x{u) > \,p h .y{u) < A} C {w: A - e < ^(tt) < A + e} = U h , e , 

and therefore, sign(ph.x{u) — A) = sign(p^(u) — A) for all u ^ C/; lie . Thus, the previous expression for 
is upper bounded by 

2 / P x , y ({p h ,x(u) > X,Ph,y(u) < A} n A,e) dP(u) + 2K ie - K * nhd£2 
Ju h:t 

which, in turn, using independence, is no larger than 

jr h (u)(l - n h (u))dP(u) + 2K ie r K - nhd ' 2 < P(U h , £ )A h , e + 2^*^^ . 

>u hit 

As for the lower bound, from (42) we obtain, trivially, 

€(h) > 2 f n h (u)(l - ir h (u))dP(u) > P(U h , e )A hi€ . 

Ju h , t 



Proof of Lemma 4.3: For simplicity, we will provide the proof for the case of a spherical kernel, i.e. K(x) = 
l||x||<i> x £ The extension to other compactly supported kernels is analogous. 

By the minimal spacings theorem (see Deheuvels et al., 1988), for all N large enough, there exists a 
constant C > such that, P-almost surely, the quantities 

min | \Xi - Xj 1 1 , min | \Yi - Yj 1 1 and min | \Xi - Yj \ \ 
/, \ 1 / d 

are all larger than C ( -^p J . Hence, by the compactness of the support of K, if h < C(\ogn/n) 1 / d /2, the 

sets B{X\, h), . . . , B(X n , h), B(Y\, h), . . . , B(Y n: h) are disjoint. Therefore, ph.x(u) = l/(nh d ) if and only if 
u e B(Xi,h) for one i and, similarly, ph.y(u) = l/(nh d ) if and only if u G B(Yj,h) for one j. Furthermore, 



L h>x AL h<Y = \ \jB(X h h)j |J (ijBiY^h) 
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As a result, 3 n (h) is the fraction of Z/s contained in (UiB(Xi, h)) (J (U l B(Y i ,h)). Thus, 

3„(/i) = P z (L htX AL htY \X, Y) = B/n, 

where = denotes equality in distribution and B ~ Binomial(n,p ), with < po < 2n p max v d h d and p n 
\\p\\oo. Therefore, Ez\3 n (h)\X, Y] < 2p max v d nh d and hence it follows that 

£(h) = E x , Y>z [Z n (h)} < 2p max v d nh d = 0(h d ), 

as h — > 0. 



Proof of Lemma 4.4. If K is the spherical kernel, note that ph,x(u) = n 1 J2i=i Bi{u), where 

|li - Xi\\\ _ lB(u,h)(Xi) 



Bi = hr d K 



(h d v d ) ' 



with Ib(uJi)( ) denoting the indicator function of the ball B(u, h). Let a 2 (u, h) = V&r(Bi(u)) and (13(11, h) = 
E\Bi(u) - n(u, h)\ 3 where /j,(u, h) = E(Bi(u)) = p h (u). Finally, let p u h = P(B{ Ul h)). Then, 



a 2 (u,h) 



Pu,h(l ~Pu,h) 

(h d v d y 



(44) 



and 



fj, 3 (u,h) = 



PuA 1 ~Pu,h) 


(l-p«,fc) 2 +<fc 


(1 


i d v d ) 3 



< 



PuM 1 ~Pu,h) 



[h d v d f ' 

where the last inequality holds since (1 — p u ,h) 2 + Puh — 1> f° r an u an< ^ h. As a result, 

Hz{u,h) 1/2 
-37 — rr < (P«,h(l -JVJ) 



(45) 



By assumption, ft < h(S, e) and e < A/2. In order to avoid trivialities, we further assume that P(Uh, e ) > 0. 
Then, uniformly over all u in Uh,e, 



and 
Thus, 



(X - e)v d h d < p u . h < (A + e)v d h d 
(1 - Pu,h) > S. 

fi 3 (u, h) 



< 



< 



a 3 (u,h) ~ V Sv d h d (X-e) ~ \ h d 6v d X' 
with the last inequality holding because of our assumption e < A/2. From (44), we then obtain 

^- £ )< CT >,.)< (A + e) 



v d 



h d 



Thus, 
where 



hd <^(u,h)< Vd , 



SX , 3A 
ai = - — and a 2 = - — , 

2v d 2v d 



(46) 
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uniformly over u € Uh, e - 

Writing a 2 (u, h) — a(u, h)/h d and using the Berry-Esseen bound (Wasserman (2004) p 78), we obtain 



sup 

t 



P 



a(u, h) 



<t \- $(*) 



< 



33 n 3 (u,h) C(S,X) 



4 a 3 (u,h)y/n 



ih d ' 



where $ is the cumulative distribution function of the standard Normal distribution. 
Now, 



ir h (u) = V x {pkx{u) > A) 



x 



/ nh?(Ph,x(u) ~ Vh{u)) Vnh 3 ^- p h {u)) 



Hence, 



1 - $ 



'VnhA(X-p h (u))\ C(6,X) 



a(u, h) 



ih d 



a(u, h) 



< ir h (u) < 1 - $ 



a(u, h) J 

nf?{X-p h {u))\ | C(S,X) 
I \fnh d 



a{u, h) 



Using the fact that u € Uh.e, and taking advantage of the uniform bounds a,\ < a(u, h) < a 2 , the previous 
inequalities imply 



1 - $ 



^e ] _ C(o, A) 
VrJ? 



Noting that 



and 



we obtain the bounds 



a i 



1 - $ 



< WhM < 1 - $ 



"nh^e ^ + CjS, A) 



a 2 



ih d t 



«i 



1 - $ -- 



4> 



= $ 



«i 



ih d ( 



> $ 



< $ 



ih d e 



a 2 



ih d t 



(In 



a 2 



$ 



tfe d e ^ _ C(o,A) 
Vnh^ 



«2 



<7T h ( U ) <!-$ -- 



ai 



C 



a 2 



ih d 



and 



1 - $ I I < tt^u) < $ I 



C 



CLi 



lh d ' 



'nh d \ o,i 

respectively. Thus, uniformly over all e < A/2 and all h < h(S, e), equation (47) and (48) yield 

2 



and 



Afc, e = 2 sup n h (u)(l -7T h {u)) < 2 1-$ 
ueu h , t \ 



A he = 2 inf 70,(«)(1 - 7T h («)) > 2 1-$ 

uGUh,t \ 



ih d t 



0. 2 



g^A) 



ih d e \ _ C(6, A) 
Vnh^ 



(47) 



(48) 



respectively, where ax and a 2 are given in (46). 
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Proof of Lemma 4.5. Letting 1, = l^ Zi eL x h AL Yh }> we nave 

1 ™ 

i=l 

where, conditionally on X and F, the 1/s are independent and identically distributed Bernoulli random 
variables with E z [l 2 |X,y] = P(L hiX /±L h<Y ). Thus 

Y[E n (h)} = E x , Y , z [E 2 n (h)]-e(h) 

= ^ExyIEz \(ZLi^ + E^khik)\x,Y] \ -e(h) 



< 



n 
n 

em 



n-1 , 



P 2 (L hX H.L h: \ 



2n ^X,Y 

2n ah)-e(h) 

2n m- 



P{L KX AL KY ) -e(h) 



Proof of Lemma 4.6. 

Let £(h,X, Y) = Ez[E n (h)\X, Y] and let Ah^ be the event given in (43), where e, h > 0, so that 
^x,Y{A c h e ) < 2K t exp {~nK 2 h d e 2 } by (8). Then, we can write 

Px,y,z (|H n (/i) - £(h)\ >t)= V x ^ z (\E n (h) - £(h, X, Y) + £(h, X, Y) - £(h)\ > t) , 
which is therefore upper bounded by 

Px.y.z (\E n (h) - £(h, X, Y) + t(h, X, Y) - £(h)\ > t; A h>e ) + 2K, exp {-nK 2 h d e 2 } . 
The first term in the previous expression is no larger than 



E 



X,Y 



*z (\E n (h) - £(h, X,Y)\> tt]\x, y) ; A h 
for any rj e (0, 1). We will first show that, if (26) is satisfied, 

P X , Y (\t(h, X, Y) - i(h)\ > t(l - rj);A h , e ) = 



+ F X , Y (\£(h, X, Y) - Z(h)\ > t(l - tj); A h , e ) . 



Indeed, first observe that 

and that, on Ah,e, 

Lh,x^Lh,Y 



Ez[E n (h)\X,Y} =P(L h>x AL htY ) 



= {u: f>h,x{u) > X,p h ,Y( u ) < A} U {u: ph,x(u) < X,p h ,Y(u) > A} 

C {u: Ph(u) > A - e,ph(u) < A + e} 

= {u: \p h (u) - A| < e} 

- U h , e , 



Therefore, on Ah e, 



£(h,X,Y) =E z [E n (h)\X,Y] < r h , e < t(l-r)). (49) 

By part 2 of Theorem 4.2, (26) further implies that t(l — rj) > As a result, onA h ,e, \$(h,X,Y) - £(h)\ < 
f(l — rj), which yields 

F XiY (\C(h,X,Y)-^(h)\>t(l-ri);A h<e )=0, 



32 



as claimed. 

We now proceed to bound from above 



E XtY [P z (\E n (h) - X, Y)\ > tr, X, Y ) ; A h , 

Since 

1 " 



(50) 



,, z / {Z,Gih,xAL h ,y}' 



Bernstein's inequality (see, for instance, Massart, 2006, Proposition 2.9) yields that, for any t > and 
conditionally on X and Y, 

¥ Z (\E n (h) - £{h, X.Y)\> tf}\x, r) < exp {-9a 2 (X, Y, h)g ( g^^fe) ) } (51) 
where <?(u) = 1 + u — </! + 2u for all u > 0, and 

<T 2 (x ! y,/ l ) = Vaj- 2 [H n (/ l )|x,y]. 

It is easy to see that 

a 2 (X,Y,h) <E z [E n (h)\X,Y]=n£(h,X,Y) 

and, therefore, restricting to the event Ah.e, & 2 (X, Y, h) < nt(l — rj), just like in (49). 

Using the fact that e -9a:9 (^) is increasing in x for x > 0, we conclude that, on the event Ah, e > the right 
hand side of (51) is bounded from above by 

exp < — 9ni(l — rj)g 



3(1 - r,) 

which is independent of X and Y. Thus, the previous expression is an upper bound for (50) and, therefore, 
for Vx,y,z (|S n (/i) — £,(h)\ > t). The claim now follows from simple algebra. 



Proof of Theorem 4.7. 

1. The proof is almost the same as the proof of part 1 of Theorem 4.2 and is therefore omitted. 

2. Let Au,e denote the event 

maxjjlp^x -ph\\oo, \Xh,a - Xh,a,x\, \\Ph,Y -Ph\\oo, \Xh,ot ~ Xh,a,Y\\ < £, (52) 

where e = e(An 3 + 1) + Ak 3 /u. Then, using (8), (16) and the fact that e < e, the union bound yields 

^x,Y{A c h .i) < 4K ie - K2nhde2 + 16ne-" c2/32 = C(h, e, n) (53) 

Now, on Ah,e, 

{u : ph : x(u) > Xh,a,X,Ph,Y(u) < Xh.ax} Q {u : p h (u) > Xh,a,X - £,Ph{u) < Xh,a,Y + ?} 

C {u : p h {u) > X h ,a - 2e,p h (u) < X, ha + 21} 
= {u : \p h (u) - Xh, a \ < 2e} 
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= 2 

< 2 

< 2 

< 2 



and therefore, sign{ph,x{u) — \h.a,x) — sign(pfc(u) — \h,a) f° r all u ^ Uh,2i,a- Next, just like in the 
proof of part 2 of theorem 4.2, using this fact and the result of the first part we can write 

£(h,a) < Ex,yMPz({u: Ph,x{u) > X h , a>x }A{u: p h , Y {u) > X h , a , Y })l Ah . {X, Y)} + P x ,y (A c hfi ) 
3 x,Y{{ph,x{u) > X h . a ,x,Ph, Y (u) < X h , a ,Y}nA h ,e)dP(u) + W>x,Y(Al t z) 

Wx,Y({ph,x(u) > X h , a ,x,Ph,Y(u) < Xh.a.v} H A h ,i)dP(u) + C{h,e 7 n) 

Vx,Y(Ph : x(u) > X h , a ,x,Ph,Y(u) < X hta , Y )dP(u) + C(h,e,n) 

Th,a(u)(l - TTh,a{u))dP(u) + C(h,e,ri) 
< P(U ht 2e, a )A ht e, a + C(h,e,n). 

As for the lower bound, from the result of first part we obtain, trivially, 

£(h,a) > 2f Uh2 ^ir hia (u)(l-ir h , a (u))dP(u) 

> P(U h ,2e, a )Ah,e, a - 

3. To compute an upper bound for A h ^^ a and a lower bound for A h e a , we use the Berry-Esseen bound 
and the stated assumptions. The proof is very similar to the proof of lemma 4.4, except that the result 
holds only on the event Ah,v Therefore, we only provide a sketch of the arguments. 

The assumptions that e < inf/j implies that, for any u € Uh,2e,a, 



u h . 



1 SX a<h S(\ a ,h - 2e) 2 

< —. < a (u,h) < 



h d 2 



h d v d 



(X a ,h + 2e) 1 3A Qj ^ 



h d v d 



h d 2 



Because of this and the fact that, on Ah,i, \ph(u)— Xh,a,x\ < 3eforall-u € Un,2i,a, the same Berry-Esseen 
arguments used in the proof of lemma 4.4 yield 



1 - $ 



/ 3lV^h d \ C(5,X h<a ) 



< K hia> ~ e (u) < 1 - $ - 



3eVnh d \ C{5,X h . a ) 



\ ai ) Vnh d \ a 2 

where ir h , a ,e(u) = Px ({Ph,x('u) > Xh, a ,x} n An,tj, ai = SXh, a /(2v d ), a 2 

33 / 2 



J Vnh 3 

3X h , a / (2v d ), and C(6, \ h , 



4 V Sv d X hi , 



and 



-. Now notice that 



~Kh, a (u) > nh,a,e( u ) > 1 - * 



?>~eVrt? \ _ C{5, X h>a ) 
oi I \fnh d 



7T h , a ( U ) < Tfc,a,«(«) + P{AU) <!-$(- 



3evW\ C(£,A/i,«) 



C) 2 



ih d 



+ C(h,e,n). 



where C(h, e, n) is defined in (53). Therefore, 

A hte , a = 2 sup ^»(l-n»)<2 1-f - 1 
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Proof of Theorem 4.12. (1) Since the sample space is compact, n(S) < oo, where S denotes the support of 
P and \i denotes the Lebesgue measure. Therefore, we obtain the inequality 

, . u(S),,„ A n(S)..^ n(S)..^ 

r„(/lj < — — \\Ph,X —Vh,Y\\ao < \\Ph,X ~ Ph\\oo H 5— I \Ph,Y ~ Ph\ |oo 

= /i(S)||p/,,JC - Ph| |oo- 

Next, let C = M s ^( a + 2 \ so that for n> K x 



t h > 



l/i(S) !i )og(n a + 1 K 1 ) 
K 2 nh d 



Then, 



J x,r (r„(/i) > t h for some h€H n ) < Pjf ( ||p/i.x — P/i||oo > — 7777 for some /i E 

< E p * fn^,x-Pfciu > -% 

< J2 K ie xp{-K 2 nt 2 h h d /(n(S) 2 )} 

. a. 1 ^ 



n a + 1 n 

where the third inequality stems from (8) and the assumption that n is large enough, and the last inequality 
follows from the assumed condition on 8. 

(2) Consider any h < h*. Note that 

F n (h) > T n ,s(h) = - [ \ph,x{u) - p h ,Y{v)\du. 



2 

Let 



s 



D(u) = Vnh d (p h , x (u) -p htY (u)). 

The variance of D(u) is 



Var 



(Vnh^(ph,x(u) - Ph,Y(u))\ = nh d (Va,r(p h , x (u)) + Var(p\y(ii))) 

= 2nh d Var( PhtX (u)) 



( 1 n 

2 ^ Var b^ E/(l|Xl 

\ i=l 
2r? 2 /) d 

^fVar(/(||JC 4 -«||<fc)) 



tt|| < /i) 



n 

= ip(B(u,ft))(l-P(S(«,/i))). 

Now, for u e S, by (33), 

P(S(u, /i))(l - h))) < P(B{u, h)) < a 2 h d v d 
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and 

P(B(u, h))(l - P(B(u, h))) > P(B(u, h))6 > ai h d v d S. 

Hence, 

2a 1 v d S < Vax(D(u)) < 2a 2 v d , Vu e S, 

which shows that the variance of D(u) is bounded above and below by positive functions that do not depend 
on h. By a similar calculation, Cov(D(u), D(v)) is bounded above and below by functions that do not depend 
on h, for all u,v € S. 
Now, for any u, 

D(u) = Di(«) - D 2 (u) = Vrt?(P n - P)(f u ) - v^(Q„ - P)(f u ) 

where P n is the empirical measure based on X\, . . . , X n , Q n is the empirical measure based on Y\, . . . , Y n , 
and /„(•) = h~ d K(\\u — -\\/h). Note that Di and D 2 are independent, mean stochastic processes. We 
can regard {Vnh d (P n - P)(f) : /€ J} as an empirical process, where T = {/„ : u e S} and similarly 
for {Vnh d (Q n — P)(f) ■ f € J 7 }. For fixed /i, the collection J 7 is a Donsker class. Hence, for every 
u e S, Di(u) and D 2 (u) converge to two independent mean Gaussian processes. By the continuous 
mapping theorem, for every u e S, D(u) converges to a mean Gaussian process & with some covariance 
kernel k. By the calculations above, there exist positive bounded functions r(u,v) < s(u,v) such that 
r(u, v) < k(u, v) < s(u, v) and such that neither r nor s depend on h. Hence 




where the last probability is the law of the Gaussian process G. Since G has strictly positive variance, 
P (/ |G| > 0) = 1. Clearly, P (/ |G| > 2t) is decreasing in t. Hence, for each S, there is a positive t such that 

F{y\G\>t)>l-6/2. 

(3) The proof of this part is straightforward and is omitted. 
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